An Automated Biomedical Research Podcast Generation System
Author
UBMI-IFC Team
Published
November 19, 2025
Modified
November 19, 2025
Scraper
Obtener todas las publicaciones (2021-2025) de la página del IFC.
Code
import jsonimport pandas as pdimport itablesfrom IPython.display import HTML, display# Enable interactive mode for all DataFrames in the notebookitables.init_notebook_mode(all_interactive=True)# Silence itables typing warnings about undocumented 'options' argumentitables.options.warn_on_undocumented_option =False# Cargar JSONtry:withopen('./data/raw/all_ifc_publications.json', 'r', encoding='utf-8') asfile: publications_scraper = json.load(file)print(f"Resultado de scraper: {len(publications_scraper)} publicaciones")exceptFileNotFoundError:print("Error: './data/raw/all_ifc_publications.json' no existe.") publications_scraper = []except json.JSONDecodeError:print("Error: formato inválido de JSON.") publications_scraper = []# Si hay datos, convertir a DataFrame y mostrar con itables (si está disponible)if publications_scraper:# Normalizar estructuras JSON anidadas a columnas planas df = pd.json_normalize(publications_scraper)# Mostrar en orden aleatorio df = df.sample(frac=1).reset_index(drop=True)# Inject CSS to truncate long text with ellipsis in the rendered DataTable css =""" <style> /* Target DataTables cells rendered by itables */ table.dataTable td, table.dataTable th { max-width: 180px; white-space: nowrap; overflow: hidden; text-overflow: ellipsis; } /* Allow horizontal scrolling for very wide tables */ div.dataTables_wrapper { overflow-x: auto; } </style> """ display(HTML(css))# DataTables options: fix column widths and enable horizontal scroll dt_options = {'autoWidth': False,'columnDefs': [{'targets': '_all', 'width': '180px'}],'scrollX': True,'pageLength': 25 } itables.show(df, classes='stripe hover order-column', options=dt_options)
Resultado de scraper: 404 publicaciones
Loading ITables v2.5.2 from the init_notebook_mode cell...
(need help?)
Expandir base de datos
Queremos expandir la base de datos buscando en PUBMED las publicaciones relacioneadas a la afiliación “IFC, UNAM”. PAra obtener los nombres de las variaciones del instituto, vamos a ver cómo se escribe en los artíóculos para mejorar la busqueda.
Obtener PDFs
Para esto, podemos usar una de las 3 variaciones reportadas en ‘notebooks/02_expand_database.ipynb’. Se utilizó la opción B (exportar las publicaciones a bib y de ahí importar a zotero y utilizar una extensión que descargue los PDFs).
Code
# Let's also check the actual BibTeX contentpublicaciones_bibtex ='./data/processed/all_ifc_publications.bib'print("\n📄 Ejemplo de artículos en BibTeX:")withopen(publicaciones_bibtex, 'r', encoding='utf-8') as f: content = f.read()# Show first entry first_entry_end = content.find('\n}\n') +3print(content[:first_entry_end])
📄 Ejemplo de artículos en BibTeX:
@article{Abbruzzini2021_ifc_235,
abstract = {ABSTRACTPurpose: The production of Technosols is a sustainable strategy to reuse urban wastes and to regenerate degraded sites. However, little is known regarding the role of the activity of enzymes associated with carbon and nutrients cycling on organic degradation and microbial activity in these soils. Methods: A controlled experiment was conducted with Technosols made from construction wastes, wood chips, and compost or compost plus biochar, in order to evaluate their organic matter (OM) degradation potential and functioning through the activity of enzymes and microbial community composition. Results: The Technosols had organic carbon contents from 13 to 30 g kg−1, carbon-to-nitrogen ratio from 10 to 20, and available phosphorus from 92 to 376 mg kg−1. The Technosols with biochar and compost had alkaline pH and higher contents of organic carbon and available phosphorus compared to Technosols with compost alone. The mixture of wood chips and compost presented the highest enzyme activities, and might be the most appropriate for Technosol’s production. The mixture of concrete and excavation waste with compost and compost plus biochar displayed a potential for OM decomposition comparable to that of wood chips with compost plus biochar. These results suggest that the bacterial and archaeal fingerprint is similar among the Technosols, although differences are observed in the relative abundances of their taxa. Conclusions: Substrate composition affects the processes of OM transformation, microbial biomass activity, and composition. The mixture of wood chips and compost presented the highest enzyme activities during the incubation period, and might be the most appropriate for its application as a Technosol. The mixture of concrete and excavation waste with either compost or compost plus biochar displayed a potential for organic matter decomposition that was comparable to that of the mixture of wood chips with compost plus biochar. The microbial communities in these Technosols are not significantly different yet, but the bioavailability of nutrients derived from the changes in the soil matrix (by adding construction waste and biochar) is influencing soil enzymatic activity.},
author = {Abbruzzini and T. F. and Reyes-Ortigoza and A. L. and Alcántara-Hernández and R. J. and Mora, L. and Flores, L. and & Prado, B.},
doi = {10.1007/s11368-021-03062-2},
journal = {Journal of Soils and Sediments},
note = {Instituto de Fisiología Celular, UNAM},
title = {Chemical, biochemical, and microbiological properties of Technosols produced from urban inorganic and organic wastes},
url = {https://www.ifc.unam.mx/publicacion.php?scopus=85114778157},
year = {2021}
}
PDF obtenidos:
Code
from pathlib import Pathdir_path = Path('papers/downloaded/zotero')ifnot dir_path.exists():print(0)else: count =sum(1for p in dir_path.rglob('*') if p.is_file() and p.suffix.lower() =='.pdf')print(count)
345
Minado de afiliaciones
Ahora vamos a extraer a partir del texto de los PDFs (utilizando ‘PyMuPDF’) en conjunto con ‘spaCy’.
Utiliza regex y NLP para detectar las afiliaciones
Consideramos inglés y español (‘en_core_web_sm’ y ‘es_core_web_sm’)
A partir de las afiliaciones, expandiremos la búsqueda de Pubmed
Resultados:
Code
# Load the pre-filtered affiliation resultsimport jsondef load_filtered_affiliations(min_score=15.0):""" Load pre-filtered affiliation clusters for PubMed searches. Args: min_score: Minimum relevance score to include (default: 15.0 for high quality) Returns: List of affiliation terms optimized for PubMed searches """ filtered_file ='./data/processed/filtered_affiliations.json'print(f"📁 Cargando afiliaciones: {filtered_file}")withopen(filtered_file, 'r', encoding='utf-8') as f: data = json.load(f) clusters = data['relevant_affiliation_clusters']# Filter by score and extract search terms affiliation_terms = []for cluster in clusters:if cluster['relevance_score'] >= min_score:# Add representative term representative = clean_affiliation_for_search(cluster['representative'])if representative: affiliation_terms.append(f'"{representative}"[Affiliation]')# Add top variations (limit to avoid too many terms)for variation in cluster['variations'][:3]: # Top 3 variations per cluster cleaned = clean_affiliation_for_search(variation)if cleaned andlen(cleaned) >10: # Only substantial terms search_term =f'"{cleaned}"[Affiliation]'if search_term notin affiliation_terms: # Avoid duplicates affiliation_terms.append(search_term)print(f"✅ Cargando {len(affiliation_terms)} términos de búsqueda de {len([c for c in clusters if c['relevance_score'] >= min_score])} clusters")return affiliation_termsdef clean_affiliation_for_search(term):"""Clean an affiliation term for PubMed search."""import reifnot term:return""# Remove common noise patterns cleaned = re.sub(r'[•\d]+\s*', '', term) # Remove bullets and leading numbers cleaned = re.sub(r'[^\w\s\-,.]', ' ', cleaned) # Remove special chars except basic punctuation cleaned = re.sub(r'\s+', ' ', cleaned) # Normalize whitespace cleaned = cleaned.strip()# Remove very generic prefixes prefixes_to_remove = ['the ', 'a ', 'an ', 'at the ', 'from the ']for prefix in prefixes_to_remove:if cleaned.lower().startswith(prefix): cleaned = cleaned[len(prefix):]# Skip very short or generic termsiflen(cleaned) <8:return"" generic_terms = ['university', 'institute', 'department', 'school', 'college']if cleaned.lower() in generic_terms:return""return cleaned# Load the filtered affiliationsprint("🔍 Afiliaciones para búsqueda de PUBMED")print("="*60)filtered_affiliations = load_filtered_affiliations(min_score=15.0)print(f"\n📋 TOP 10 términos:")for i, term inenumerate(filtered_affiliations[:10], 1):print(f"{i:2d}. {term}")
🔍 Afiliaciones para búsqueda de PUBMED
============================================================
📁 Cargando afiliaciones: ./data/processed/filtered_affiliations.json
✅ Cargando 95 términos de búsqueda de 50 clusters
📋 TOP 10 términos:
1. "Instituto de Fisiología Celular"[Affiliation]
2. "Molecular Genetics, Instituto de Fisiología Celular"[Affiliation]
3. "Universidad Nacional Aut onoma de México"[Affiliation]
4. "Universidad Nacional Autónoma de México,"[Affiliation]
5. "Universidad Nacional Aut onoma de M exico"[Affiliation]
6. "Institute for Cellular Physiology"[Affiliation]
7. "Institute of Cellular Physiology"[Affiliation]
8. "Institute of Cellular Physiology at UNAM"[Affiliation]
9. "Department of Biochemistry and Structural Biology, Institute of Cellular Physiology"[Affiliation]
10. "Department of Biochemistry and Structural Biology,"[Affiliation]
Después de revisar y limpiar manualmente los términos de búsqueda encontrados, hacemos la búsqueda en Pubmed y expandimos la base de datos:
Code
# Compare counts between two JSON files and show a table with the raw JSON contents for explorationfrom pathlib import Pathimport jsonimport pandas as pdimport itablesfrom IPython.display import HTML, displayitables.init_notebook_mode(all_interactive=True)itables.options.warn_on_undocumented_option =Falsedef safe_load_json(path):try:withopen(path, 'r', encoding='utf-8') as f:return json.load(f)exceptFileNotFoundError:print(f"⚠️ Archivo no encontrado: {path}")returnNoneexcept json.JSONDecodeError:print(f"⚠️ JSON inválido: {path}")returnNonedef count_records(obj):"""Heurística simple para contar entradas en un JSON cargado."""if obj isNone:return0ifisinstance(obj, list):returnlen(obj)ifisinstance(obj, dict):# If dict contains an obvious list of records, pick the largest list list_lengths = [len(v) for v in obj.values() ifisinstance(v, list)]if list_lengths:returnmax(list_lengths)# fallback: count top-level keys as 1 record (or 0)return1if obj else0return0# Paths to compare (if the "processed" file does not exist, we'll compare the same raw file)path_a = Path('./data/raw/all_ifc_publications.json')path_b = Path('./data/processed/pubmed_filtered_search_results.json')ifnot path_b.exists():# Fall back to the same file if processed version is not present path_b = path_ajson_a = safe_load_json(path_a) or []json_b = safe_load_json(path_b) or []count_a = count_records(json_a)count_b = count_records(json_b)diff = count_b - count_a# Print a short comparison summaryprint(f"Archivo A: {path_a} → {count_a} entradas")print(f"Archivo B: {path_b} → {count_b} entradas")if path_a.samefile(path_b):print("Nota: ambos paths apuntan al mismo archivo.")print(f"Diferencia (B - A): {diff}")# Mostrar una pequeña tabla resumen con pandas + itablessummary_df = pd.DataFrame({'archivo': [str(path_a), str(path_b)],'entradas': [count_a, count_b]})display(HTML("<h4>Resumen de conteo</h4>"))itables.show(summary_df, classes='stripe hover order-column', options={'paging': False, 'searching': False})# Finalmente, mostrar el contenido del JSON A como DataFrame para exploracióndisplay(HTML("<h4>Exploración: contenido de ./data/processed/pubmed_filtered_search_results.json</h4>"))def extract_record_list(obj):"""Return a list of records from common JSON shapes."""if obj isNone:return []ifisinstance(obj, list):return objifisinstance(obj, dict):# Pick the largest list value if any (e.g. 'articles', 'items', 'results') list_values = [v for v in obj.values() ifisinstance(v, list)]if list_values:returnmax(list_values, key=len)# Otherwise treat the dict itself as a single recordreturn [obj]# Fallback: wrap scalar into listreturn [obj]# Replace the previous normalization attempt with:records = extract_record_list(json_b)ifnot records:print("No records found in JSON to normalize.")else:try: df_raw = pd.json_normalize(records)exceptException:# Last-resort: wrap the whole object df_raw = pd.DataFrame(records)# Keep display compact: randomize order and limit columns width (CSS) df_raw = df_raw.sample(frac=1).reset_index(drop=True) css =""" <style> table.dataTable td, table.dataTable th { max-width: 180px; white-space: nowrap; overflow: hidden; text-overflow: ellipsis; } div.dataTables_wrapper { overflow-x: auto; } </style> """ display(HTML(css)) itables.show(df_raw, classes='stripe hover order-column', options={'autoWidth': False,'columnDefs': [{'targets': '_all', 'width': '180px'}],'scrollX': True, 'pageLength': 25})
from pathlib import Pathfrom IPython.display import HTML, IFrame, displayviz_path = Path('./notebooks/notebooks/data/processed/embeddinggemma_visualization_a.html')if viz_path.exists():try:# Try embedding raw HTML so interactive JS/CSS stays inline display(HTML(viz_path.read_text(encoding='utf-8')))exceptException:# Fallback to an iframe if direct embedding fails display(IFrame(src=str(viz_path), width='100%', height=700))else:print(f"Visualization not found: {viz_path}")
Clusters
Se utilizó la técnica de TF-IDF (Term Frequency-Inverse Document Frequency), utilizada en procesamiento de lenguaje natural (NLP),para identificar palabras clave o términos representativos.
Note
Frecuencia de término (TF): Mide cuántas veces aparece un término en un documento en relación con la longitud total del documento. Esto ayuda a identificar términos frecuentes dentro de un documento.
\[
TF(t) = \frac{\text{Número de veces que aparece el término } t}{\text{Número total de términos en el documento}}
\]
Frecuencia inversa de documento (IDF): Mide qué tan único es un término en el corpus. Si un término aparece en muchos documentos, su IDF será bajo, ya que no es distintivo. \[
IDF(t) = \log\left(\frac{\text{Número total de documentos}}{\text{Número de documentos que contienen el término } t}\right)
\]
TF-IDF: Es el producto de TF e IDF. Los términos con un TF-IDF alto son aquellos que son frecuentes en un documento pero raros en el resto del corpus, lo que los hace representativos.
\[
TF-IDF(t) = TF(t) \times IDF(t)
\]
Code
from pathlib import Pathfrom IPython.display import HTML, IFrame, displayimport reviz_path = Path('notebooks/notebooks/data/processed/embeddinggemma_visualization_cluster.html')ifnot viz_path.exists():print(f"Visualization not found: {viz_path}")else: html_text = viz_path.read_text(encoding='utf-8')# Try to embed the full HTML inline (best for preserving JS/CSS)try: display(HTML(html_text))exceptException:# If inline embedding fails (often due to heavy JS), show a static table (if present) table_match = re.search(r'(<table[\s\S]*?>[\s\S]*?</table>)', html_text, flags=re.IGNORECASE)if table_match: display(HTML(table_match.group(1)))# Always provide an iframe fallback for the interactive plot/visualization display(IFrame(src=str(viz_path), width='100%', height=800))
Code
from pathlib import Pathfrom IPython.display import HTML, IFrame, displayimport rehtml_path = Path('notebooks/notebooks/data/processed/embeddinggemma_cluster_summary.html')ifnot html_path.exists():print(f"Visualization not found: {html_path}")else: html_text = html_path.read_text(encoding='utf-8') table_html =None# Prefer BeautifulSoup if available for robust extractiontry:from bs4 import BeautifulSoup soup = BeautifulSoup(html_text, 'html.parser') tables = soup.find_all('table')if tables: table_html =''.join(str(t) for t in tables)exceptException:# Fallback to regex if BeautifulSoup is not installed m = re.search(r'(<table[\s\S]*?>[\s\S]*?</table>)', html_text, flags=re.IGNORECASE)if m: table_html = m.group(1)if table_html:# Wrap table in a responsive container to allow horizontal scrolling display(HTML(f'<div style="overflow-x:auto">{table_html}</div>'))else:# If no table was found, embed the full HTML as an iframe display(HTML("<p>No <table> found in the visualization file. Showing full HTML below.</p>")) display(IFrame(src=str(html_path), width='100%', height=800))
cluster
size
top_terms
top_journal_example
year_range
0
135
species, bacterial, host, bacteria, microbial, study